NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Clustering single-cell RNA-seq data by rank constrained similarity learning

https://doi.org/10.1093/bioinformatics/btab276

Mei, Qinglin; Li, Guojun; Su, Zhengchang (May 2021, Bioinformatics)
Mathelier, Anthony (Ed.)
Abstract Motivation Recent breakthroughs of single-cell RNA sequencing (scRNA-seq) technologies offer an exciting opportunity to identify heterogeneous cell types in complex tissues. However, the unavoidable biological noise and technical artifacts in scRNA-seq data as well as the high dimensionality of expression vectors make the problem highly challenging. Consequently, although numerous tools have been developed, their accuracy remains to be improved. Results Here, we introduce a novel clustering algorithm and tool RCSL (Rank Constrained Similarity Learning) to accurately identify various cell types using scRNA-seq data from a complex tissue. RCSL considers both local similarity and global similarity among the cells to discern the subtle differences among cells of the same type as well as larger differences among cells of different types. RCSL uses Spearman’s rank correlations of a cell’s expression vector with those of other cells to measure its global similarity, and adaptively learns neighbor representation of a cell as its local similarity. The overall similarity of a cell to other cells is a linear combination of its global similarity and local similarity. RCSL automatically estimates the number of cell types defined in the similarity matrix, and identifies them by constructing a block-diagonal matrix, such that its distance to the similarity matrix is minimized. Each block-diagonal submatrix is a cell cluster/type, corresponding to a connected component in the cognate similarity graph. When tested on 16 benchmark scRNA-seq datasets in which the cell types are well-annotated, RCSL substantially outperformed six state-of-the-art methods in accuracy and robustness as measured by three metrics. Availability and implementation The RCSL algorithm is implemented in R and can be freely downloaded at https://cran.r-project.org/web/packages/RCSL/index.html. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
RecBic: a fast and accurate algorithm recognizing trend-preserving biclusters

https://doi.org/10.1093/bioinformatics/btaa630

Liu, Xiangyu; Li, Di; Liu, Juntao; Su, Zhengchang; Li, Guojun (July 2020, Bioinformatics)
Cowen, Lenore (Ed.)
Abstract Motivation Biclustering has emerged as a powerful approach to identifying functional patterns in complex biological data. However, existing tools are limited by their accuracy and efficiency to recognize various kinds of complex biclusters submerged in ever large datasets. We introduce a novel fast and highly accurate algorithm RecBic to identify various forms of complex biclusters in gene expression datasets. Results We designed RecBic to identify various trend-preserving biclusters, particularly, those with narrow shapes, i.e. clusters where the number of genes is larger than the number of conditions/samples. Given a gene expression matrix, RecBic starts with a column seed, and grows it into a full-sized bicluster by simply repetitively comparing real numbers. When tested on simulated datasets in which the elements of implanted trend-preserving biclusters and those of the background matrix have the same distribution, RecBic was able to identify the implanted biclusters in a nearly perfect manner, outperforming all the compared salient tools in terms of accuracy and robustness to noise and overlaps between the clusters. Moreover, RecBic also showed superiority in identifying functionally related genes in real gene expression datasets. Availability and implementation Code, sample input data and usage instructions are available at the following websites. Code: https://github.com/holyzews/RecBic/tree/master/RecBic/. Data: http://doi.org/10.5281/zenodo.3842717. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
TransComb: genome-guided transcriptome assembly via combing junctions in splicing graphs

https://doi.org/10.1186/s13059-016-1074-1

Liu, Juntao; Yu, Ting; Jiang, Tao; Li, Guojun (December 2016, Genome Biology)

Full Text Available
ProSampler: an ultrafast and accurate motif finder in large ChIP-seq datasets for combinatory motif discovery

https://doi.org/10.1093/bioinformatics/btz290

Li, Yang; Ni, Pengyu; Zhang, Shaoqiang; Li, Guojun; Su, Zhengchang; Berger, ed., Bonnie (May 2019, Bioinformatics)

Abstract MotivationThe availability of numerous ChIP-seq datasets for transcription factors (TF) has provided an unprecedented opportunity to identify all TF binding sites in genomes. However, the progress has been hindered by the lack of a highly efficient and accurate tool to find not only the target motifs, but also cooperative motifs in very big datasets. ResultsWe herein present an ultrafast and accurate motif-finding algorithm, ProSampler, based on a novel numeration method and Gibbs sampler. ProSampler runs orders of magnitude faster than the fastest existing tools while often more accurately identifying motifs of both the target TFs and cooperators. Thus, ProSampler can greatly facilitate the efforts to identify the entire cis-regulatory code in genomes. Availability and implementationSource code and binaries are freely available for download at https://github.com/zhengchangsulab/prosampler. It was implemented in C++ and supported on Linux, macOS and MS Windows platforms. Supplementary informationSupplementary materials are available at Bioinformatics online.
more » « less

Search for: All records